diffusion model: backgrounds

20221210 suyako diffusion model

evaluation metrics

inception scores
the entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized
the predictions of the classification model are evenly distributed across all possible labels

$IS(p_{gen},p_{dis})=\text{exp}\Big(\mathbb{E}_{x\sim p_{gen}}\Big[D_{\text{KL}}\Big(p_{dis}(\cdot|x)\|\int p_{dis}(\cdot|x)p_{gen}(x)dx\Big)\Big]\Big)$

1*hPEJY3MkOZyKFA6yEqzuyg

a performance metric that calculates the distance between the feature vectors of real images and the feature vectors of fake images(minimum distance to transport one distribution into another distribution) $d_F(\mathcal{N}(\mu,\Sigma),\mathcal{N}(\mu',\Sigma'))=\|\mu-\mu'\|_2^2+\text{tr}\Big(\Sigma+\Sigma'-2(\Sigma^{\frac{1}{2}}\cdot\Sigma'\cdot\Sigma^{\frac{1}{2}})^{\frac{1}{2}}\Big)$ Example of How Increased Distortion of an Image Correlates with High FID Score

DDPM

forward process

Given a data point sampled from a real data distribution $\mathbf{x}_0 \sim q(\mathbf{x})$ , define a forward diffusion process in which we add small amount of Gaussian noise to the sample in $T$ steps, producing a sequence of noisy samples $\mathbf{x}_1, \dots, \mathbf{x}_T$ . The step sizes are controlled by a variance schedule $\{\beta_t \in (0, 1)\}_{t=1}^T$ . $q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})$ The data sample $\mathbf{x}_0$ gradually loses its distinguishable features as the step $t$ becomes larger. Eventually when $T \to \infty$ , $\mathbf{x}_T$ is equivalent to an isotropic Gaussian distribution.

DDPM

reparmeterization

Let $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t=\prod_{i=1}^t{\alpha_i}$ $\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} & \text{ ;where } \boldsymbol{\epsilon}_{t-1}, \boldsymbol{\epsilon}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_{t-1}}\boldsymbol{\epsilon}_{t-2}) +\sqrt{1-\alpha_t}\boldsymbol{\epsilon}_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} & \text{ ;where } \bar{\boldsymbol{\epsilon}}_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon} \\ q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}) \end{aligned}$ when $t\to\infty$ , $q(\mathbf{x}_t\vert\mathbf{x}_0)\to\mathcal{N}(\mathbf{x};0,\mathbf{I})$

reverse process

if $\beta_t$ is small, $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$ will also be Guassian, however it's not easily estimate, therefore we need to learn a model $p_\theta$ to approximate these conditional probabilities: $q(\mathbf{x}_{t-1}\vert\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1};\mu_\theta(\mathbf{x}_t,t),\Sigma_\theta(\mathbf{x}_t,t))$

It is noteworthy that the reverse conditional probability is tractable when conditioned on $\mathbf{x}_0$ : $\begin{aligned} q(\mathbf{x}_{t-1}\vert\mathbf{x}_t,\mathbf{x}_0) &=\frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1},\mathbf{x}_0)q(\mathbf{x}_{t-1},\mathbf{x}_0)}{q(\mathbf{x}_t,\mathbf{x}_0)}=\frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1},\mathbf{x}_0)q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)}{q(\mathbf{x}_t\vert\mathbf{x}_0)} \\ &\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t - \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} - \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} - \frac{(\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &=\exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 - \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)} \mathbf{x}_{t-1} \color{black}{ + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big)} \end{aligned}$ following the standard Gaussian density function, the mean and variance can be parameterized as follows: $\begin{aligned} \tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) = 1/(\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t(1 - \bar{\alpha}_{t-1})}) = \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ \tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1} }}{1 - \bar{\alpha}_{t-1}} \mathbf{x}_0) \color{green}{\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t} \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \mathbf{x}_0\\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t) \\ &= \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big) \end{aligned}$

loss

variational lower bound $\begin{aligned} - \log p_\theta(\mathbf{x}_0) &\leq - \log p_\theta(\mathbf{x}_0) + D_\text{KL}(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\vert\mathbf{x}_0) ) \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\ &= \mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ \text{Let }L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \geq - \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \end{aligned}$ the objective can be further rewritten to be a combination of several KL-divergence and entropy terms $\begin{aligned} L_\text{VLB} &=\mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ] \end{aligned}$ $L_T$ is a constant during training since $\beta_t$ are fixed, $L_{1:T-1}$ compares two Gaussian distributions and can be computed in closed form, which can be greatly simplified conditioned on setting $\Sigma_\theta(\mathbf{x}_t,t)=\sigma_t^2\mathbf{I}$ to untrained time dependent constants( $\beta_t$ or $\tilde{\beta_t}$ ): $\begin{aligned} L_{t-1} &= \mathbb{E}_q[D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))]\\ &= \mathbb{E}_q[\frac{1}{2\sigma_t^2} \parallel \tilde{\mathbb{\mu}}_t(\mathbb{x}_t,\mathbb{x}_0)-\mathbb{\mu}_\theta(\mathbb{x}_t,t) \parallel^2]+C\\ L_{t-1}-C &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{1}{2 \sigma^2_t} \| \color{blue}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)} - \color{green}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t) \Big)} \|^2 \Big]\\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{ (1 - \alpha_t)^2 }{2 \alpha_t (1 - \bar{\alpha}_t) \sigma^2_t} \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{ (1 - \alpha_t)^2 }{2 \alpha_t (1 - \bar{\alpha}_t) \sigma^2_t} \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big]\\ \text{where } D_\text{KL}(\mathcal{N}_0\|\mathcal{N}_1)&= \frac{1}{2}\Big(\text{tr}(\Sigma_1^{-1}\Sigma_0)-k+(\mu_1-\mu_0)^\text{T}\Sigma_1^{-1}(\mu_1-\mu_0)+\text{ln}(\frac{\text{det}\Sigma_1}{\text{det}\Sigma_0})\Big) \end{aligned}$ $L_0$ modeled by a seperate discrete decoder derived from a Guassion distribution $\begin{aligned} p_{\theta} &=\prod_{i=1}^D \int_{\delta_{\_}(x_0^i)}^{\delta_+(x_0^i)}\mathcal{N}(x;\mu_{\theta}^i(\mathbf{x}_1,1),\sigma_1^2)dx\\ \delta_+(x) &= \begin{cases} \infin & x=1\\ x+\frac{1}{255}&x<1 \end{cases} \qquad \delta_{\_}(x) = \begin{cases} \infin & x=-1\\ x-\frac{1}{255}&x>-1 \end{cases} \end{aligned}$

simplified loss

it beneficial to sample quality (and simpler to train) on the following variant $\begin{aligned} L_t^\text{simple} &= \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\ &= \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big] \end{aligned}$ The $t=1$ case corresponds to $L_0$ approximated by the Gaussian probability density function times the bin width, ignoring $\sigma_1^2$ and edge effects

The $t>1$ cases correspond to an unweighted version of $L_t$ , usually down-weight loss terms corresponding to small $t$ where amounts of noise is small

training and sampling

training	sampling
1: repeat 2: $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ 3: $t \sim Uniform({1,\dots,T})$ 4: $\boldsymbol{\epsilon} \sim \mathcal{N}(0,\mathbf{I})$ 5: Take gradient descent step on $\nabla_\theta \\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\\|^2$ 6: until converged	1: $\mathbb{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ 2: for $t=T,\dots,1$ do 3: $\mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ if $t>1$ , else $\mathbf{z}=0$ 4: $\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\Big(\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta (\mathbf{x}_t,t)\Big)+\sigma_t\mathbf{z}$ 5: end for 6: return $\mathbf{x}_0$

training

sampling

1: repeat
2: $\mathbf{x}_0 \sim q(\mathbf{x}_0)$
3: $t \sim Uniform({1,\dots,T})$
4: $\boldsymbol{\epsilon} \sim \mathcal{N}(0,\mathbf{I})$
5: Take gradient descent step on
$\nabla_\theta \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2$
6: until converged

1: $\mathbb{x}_T \sim \mathcal{N}(\mathbf{0},\mathbf{I})$
2: for $t=T,\dots,1$ do
3: $\mathbf{z} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ if $t>1$ , else $\mathbf{z}=0$
4: $\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\Big(\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta (\mathbf{x}_t,t)\Big)+\sigma_t\mathbf{z}$
5: end for
6: return $\mathbf{x}_0$

NCSN

introduction

In order to build a generative model, we first need a way to represent a probability distribution. One such way, as in likelihood-based models, is to directly model the probability density function(p.d.f). We can define a p.d.f via: $\begin{align} p_\theta(\mathbf{x}) = \frac{e^{-f_\theta(\mathbf{x})}}{Z_\theta} \end{align}$ $Z_\theta$ here is always an intractable quantity for any general $f_\theta(\mathbf{x})$ , which makes the gradient solution difficult. However, we can sidestep this difficulty by modeling the score function instead of the density function $\begin{align} \mathbf{s}_\theta (\mathbf{x}) = \nabla_{\mathbf{x}} \log p_\theta (\mathbf{x} ) = -\nabla_{\mathbf{x}} f_\theta (\mathbf{x}) - \underbrace{\nabla_\mathbf{x} \log Z_\theta}_{=0} = -\nabla_\mathbf{x} f_\theta(\mathbf{x}) \end{align}$ score_contour

Langevin dynamics

Langevin dynamics provides an MCMC procedure to sample from a distribution $p(\mathbf{x})$ using only its score function $\nabla_\mathbf{x}\log p(\mathbf{x})$ . Specifically, it initializes the chain from an arbitrary prior distribution $\mathbf{x}_0 \sim \pi(\mathbf{x})$ , and then iterates following: $\begin{align} \mathbf{x}_{i+1} &\gets \mathbf{x}_i + \epsilon \nabla_\mathbf{x} \log p(\mathbf{x}) + \sqrt{2\epsilon}~ \mathbf{z}_i, \quad i=0,1,\cdots, K \\ \text{where } \mathbf{z}_i &\sim \mathcal{N}(0,I) \end{align}$ When $\epsilon \to 0$ and $K \to \inf$ , $\mathbf{x}_k$ from this procedure converges to a sample from $p(\mathbf{x})$ under some regularity conditions. In practice, the error is negligible when $\epsilon$ is sufficiently small and $K$ is sufficiently large.

manifold hypothesis

TODO

low data density regions

inaccurate score estimation with score matching

In regions of low data density, score matching may not have enough evidence to estimate score functions accurately, due to the lack of data samples.

slow mixing of Langevin dynamics

Consider a mixture distribution $p(\mathbf{x})=\pi p_1(\mathbf{x}) + (1-\pi) p_2(\mathbf{x})$ , where $p_1$ and $p_2$ are normalized distributions with disjoint supports. In this case, the score $\nabla_\mathbf{x} \log p(x)$ doesn't depend on $\pi$ . In practice, this analysis also holds when different modes have approximately disjoint supports—they may share the same support but be connected by regions of small data density. Now, Langevin dynamics can produce correct samples in theory, but may require a very small step size and a very large number of steps to mix.

multiple noise perturbations

In order to bypass the difficulty of accurate score estimation in regions of low data density, it's natural to pertube data points with noise and train score-based models on the noisy data points instead. However, larger noise can obviously cover more low density regions for better score estimation though alters the data significantly from the original distribution; smaller noise causes less corruption of the original data distribution, but does not cover the low density regions.

To achieve the best of both worlds, NCSN use multiple scales of noise perturbations simultaneously. Suppose there be a total of $L$ steps with increasing standard deviations $\sigma_1<\sigma_2<\cdots\sigma_L$ , the $i$ th noise-perturbed distribution follow: $p_{\sigma_i}(\mathbf{x})=\int p(y)\mathcal{N}(\mathbf{x};y,\sigma^2I)dy$ duoduo

The training objective is a weighted sum of Fisher divergences for all noise scales: $\begin{align} \sum_{i=1}^L \lambda(i) \mathbb{E}_{p_{\sigma_i}(\mathbf{x})}[\| \nabla_\mathbf{x} \log p_{\sigma_i}(\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x}, \sigma_i) \|_2^2], \text{ where }\lambda(i)=\sigma_i^2 \end{align}$ After training $s_\theta(x,i)$ , samples could be produced from it by running Langevin dynamics for $i=L,L-1,\dots,1$ in sequence, which called annealed Langevin dynamic for the gradually decreased noise scale $\sigma_i$ .

learning NCSN via score matching

$\begin{align} \ell_t(\theta)&=\mathbb{E}_{p_{\sigma}(\mathbf{x}_t)}[\| \nabla_{\mathbf{x}_t} \log p_{\sigma}(\mathbf{x}_t) - \mathbf{s}_\theta(\mathbf{x}_t, \sigma)\|]_2^2\\ &=\mathbb{E}_{p_\mathbf{x}}\mathbb{E}_{p_{\sigma}(\mathbf{x}_t|\mathbf{x})}[\| \nabla_{\mathbf{x}_t} \log p_\sigma(\mathbf{x}_t|\mathbf{x}) - \mathbf{s}_\theta(\mathbf{x}_t, \sigma) \|]_2^2\\ &=\mathbb{E}_{p_\mathbf{x}}\mathbb{E}_{\mathbf{x}_t \sim \mathcal{N}(\mathbf{x},\sigma^2I)}[\| s_\theta(\mathbf{x}_t,\sigma)+\frac{\mathbf{x}_t-\mathbf{x}}{\sigma^2} \|]_2^2 \end{align}$

similarities and differences between DDPM with NCSN

differences

TODO

similarities

the schedule of increasing noise levels in NCSN resembles the forward diffusion process in DDPM. If we use the diffusion process procedure, and recall that $q(\mathbf{x}_t \vert \mathbf{x}_0) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})$ , we can get $\mathbf{s}_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t) = \mathbb{E}_{q(\mathbf{x}_0)} [\nabla_{\mathbf{x}_t} q(\mathbf{x}_t \vert \mathbf{x}_0)] = \mathbb{E}_{q(\mathbf{x}_0)} \Big[ - \frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}} \Big] = - \frac{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$

guided diffusion

In addition to employing well designed architectures, GANs for conditional image synthesis make heavy use of class labels. This often takes the form of class-conditional normalization statistics as well as discriminators with heads that are explicitly designed to behave like classifiers $p(y|x)$ .

Given this observation for GANs, it makes sense to explore different ways to condition diffusion models on class labels. Guided diffusion exploiting a classifier $p(y|x)$ to improve a diffusion generator, wherein a pre-trained diffusion model can be conditioned using the gradients of a classifier. In particular, we can train a classifier $p_\phi(y|\mathbf{x}_t, t)$ on noisy images $\mathbf{x}_t$ , and then use gradients $\nabla_{\mathbf{x}_t} \log p_\phi(y|\mathbf{x}_t, t)$ to guide the diffusion sampling process towards an arbitrary class label $y$ .

To condition an unconditional reverse noising process on a label $y$ , it suffices to sample each transition according to $p_{\theta,\phi}(\mathbf{x}_t|\mathbf{x}_{t+1},y)=Zp_\theta(\mathbf{x}_t|\mathbf{x}_{t+1})p_\phi(y|\mathbf{x}_t)$ recall $\log p_\theta(\mathbf{x}_t|\mathbf{x}_{t+1})=-\frac{1}{2}(\mathbf{x}_t-\mu)^T\Sigma^{-1}(\mathbf{x}_t-u)+C$ and approximate $\log p_\phi(y|\mathbf{x}_t)$ as follow $\begin{align} \log p_\phi(y|\mathbf{x}_t)&\approx \log p_\phi(y|\mathbf{x}_t)|_{\mathbf{x}_t=\mu}+(\mathbf{x}_t-\mu)\nabla_{\mathbf{x}_t} \log p_\phi(y|\mathbf{x}_t)|_{\mathbf{x}_t=\mu} &\text{;Talyor expansion}\\ &=(\mathbf{x}_t-\mu)g+C_1\\ \log p_{\theta,\phi}(\mathbf{x}_t|\mathbf{x}_{t+1},y)&\approx -\frac{1}{2}(\mathbf{x}_t-\mu)^T\Sigma^{-1}(\mathbf{x}_t-u)+(\mathbf{x}_t-u)g+C_2\\ &=-\frac{1}{2}(\mathbf{x}_t-\mu-\Sigma g)^T\Sigma^{-1}(\mathbf{x}_t-u-\Sigma g)+\frac{1}{2}g^T\Sigma g+C_2\\ &=-\frac{1}{2}(\mathbf{x}_t-\mu-\Sigma g)^T\Sigma^{-1}(\mathbf{x}_t-u-\Sigma g)+C_3\\ &=\log p(z)+C_4 & \text{;}z\sim \mathcal{N}(\mu+\Sigma g,\Sigma) \end{align}$ We have thus found that the conditional transition operator can be approximated by a Gaussian similar to the unconditional transition operator, but with its mean shifted by $\Sigma g$ . The final sampling algorithm like: